Set up

Load the libraries:

library(leaflet)
library(tmap)
library(sf)
library(tidyverse)

Get the data:

dc_crimes <-
  read_csv('data/raw/dc_crimes.csv')

census <-
  st_read('data/raw/census/census.shp')
## Reading layer `census' from data source 
##   `/Users/brianevans/Library/Mobile Documents/com~apple~CloudDocs/ppol/data/raw/census/census.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3492 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -83.67539 ymin: 36.54085 xmax: -74.98628 ymax: 39.72304
## Geodetic CRS:  NAD83

Managing memory

Before undertaking a GIS analysis, it’s typically good to ensure that all data take up as little memory as possible. This means reducing the data to the extent of interest and simplifying where possible (Make sure your data are as simple as possible and as complex as necessary!).

  Subset the census data to just data in which the state_name is “DC”. Assign the object to your global environment with the name dc_census. Plot the data such that the fill color is defined by the size of the population.

We can explore the size of the dc_census object using the function object_size(). I like to change the format of the resultant object to “Mb” because my brain has a hard time with big numbers:

dc_census %>% 
  object.size() %>% 
  format('Mb')
## [1] "0.8 Mb"

One way that we can reduce the size of the file is by reducing the number of columns. Let’s see how the size of the object changes by subsetting the data to the fields GEOID, edu, income, and population:

dc_census %>% 
  select(GEOID, edu, income, population) %>% 
  object.size() %>% 
  format('Mb')
## [1] "0.7 Mb"

We’ve reduced the size of the data by about 9.5%, which is a fair improvement. Let’s assign it to our global environment:

dc_census_reduced <-
  dc_census %>% 
  select(GEOID, edu, income, population)

Let’s explore the crime data. We can look at the structure of the data using str():

str(dc_crimes)
## spec_tbl_df [57,400 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id           : chr [1:57400] "19186086-01" "19186116-01" "19186130-01" "19186160-01" ...
##  $ longitude    : num [1:57400] -77 -77 -77 -77 -77 ...
##  $ latitude     : num [1:57400] 38.9 38.9 38.9 38.9 38.9 ...
##  $ date_time    : POSIXct[1:57400], format: "2019-10-18 01:31:08" "2019-10-18 01:36:33" ...
##  $ offense      : chr [1:57400] "theft f/auto" "robbery" "theft/other" "assault w/dangerous weapon" ...
##  $ offense_group: chr [1:57400] "property" "violent" "property" "violent" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_character(),
##   ..   longitude = col_double(),
##   ..   latitude = col_double(),
##   ..   date_time = col_datetime(format = ""),
##   ..   offense = col_character(),
##   ..   offense_group = col_character()
##   .. )

… and summary using summary():

summary(dc_crimes)
##       id              longitude         latitude    
##  Length:57400       Min.   :-77.11   Min.   :38.81  
##  Class :character   1st Qu.:-77.03   1st Qu.:38.89  
##  Mode  :character   Median :-77.01   Median :38.91  
##                     Mean   :-77.01   Mean   :38.91  
##                     3rd Qu.:-76.99   3rd Qu.:38.92  
##                     Max.   :-76.91   Max.   :38.99  
##    date_time                     offense          offense_group     
##  Min.   :2019-10-18 01:31:08   Length:57400       Length:57400      
##  1st Qu.:2020-03-19 18:04:34   Class :character   Class :character  
##  Median :2020-10-10 10:00:19   Mode  :character   Mode  :character  
##  Mean   :2020-10-12 12:53:22                                        
##  3rd Qu.:2021-05-03 06:03:02                                        
##  Max.   :2021-10-18 00:37:06

We are interested in looking at the violent crimes in Washington DC in 2020.

Let’s explore how crimes are classified (note the vague names):

table(dc_crimes$offense)
## 
##                      arson assault w/dangerous weapon 
##                         16                       3240 
##                   burglary                   homicide 
##                       2553                        395 
##        motor vehicle theft                    robbery 
##                       6502                       4042 
##                  sex abuse               theft f/auto 
##                        343                      17556 
##                theft/other 
##                      22753
table(dc_crimes$offense_group)
## 
## property  violent 
##    49380     8020

Before undertaking a GIS analysis, it’s typically good to ensure that all data take up as little memory as possible. This means reducing the data to the extent of interest and simplifying where possible (Make sure your data are as simple as possible and as complex as necessary!).

  Filter the data to just violent crimes that occurred in 2020. Assign the filtered object to your global environment with the name violent_crimes. Compare the size of the original object with that of the filtered object.

With our file size reduced, we can now make it a spatial object. The CRS in which the data were recorded (which is EPSG 4326) defines the CRS of the initial spatial object.

crimes_sp <-
  st_as_sf(
    violent_crimes,
    coords = c('longitude', 'latitude'),
    crs = 4326)

Now it’s time to transform the CRS of our files. Let’s project them both as EPSG 5070.

  In a single purrr::map(), transform the projections of both objects as a list and assign the objects to your global environment as census_prj and crimes_prj.

Now we’re ready to join our data. We can join with the function st_join. Notice that these the order of operations matter. How does this …

st_join(
  census_prj,
  crimes_prj)
## Simple feature collection with 3993 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1610777 ymin: 1913781 xmax: 1629412 ymax: 1936182
## Projected CRS: NAD83 / Conus Albers
## First 10 features:
##           GEOID       edu income population          id           date_time
## 1   11001010900 0.4475939  32143       3470 20000357-01 2020-01-01 11:44:05
## 1.1 11001010900 0.4475939  32143       3470 20005086-01 2020-01-09 14:58:48
## 1.2 11001010900 0.4475939  32143       3470 20006833-01 2020-01-12 04:45:48
## 1.3 11001010900 0.4475939  32143       3470 20007719-01 2020-01-13 19:06:30
## 1.4 11001010900 0.4475939  32143       3470 20015338-01 2020-01-26 12:12:27
## 1.5 11001010900 0.4475939  32143       3470 20028849-01 2020-02-16 23:15:04
## 1.6 11001010900 0.4475939  32143       3470 20030751-01 2020-02-20 02:04:37
## 1.7 11001010900 0.4475939  32143       3470 20041144-01 2020-03-07 22:35:01
## 1.8 11001010900 0.4475939  32143       3470 20051843-01 2020-03-28 05:05:17
## 1.9 11001010900 0.4475939  32143       3470 20059121-01 2020-04-14 23:50:39
##                        offense offense_group                       geometry
## 1   assault w/dangerous weapon       violent MULTIPOLYGON (((1620533 191...
## 1.1 assault w/dangerous weapon       violent MULTIPOLYGON (((1620533 191...
## 1.2                    robbery       violent MULTIPOLYGON (((1620533 191...
## 1.3                    robbery       violent MULTIPOLYGON (((1620533 191...
## 1.4 assault w/dangerous weapon       violent MULTIPOLYGON (((1620533 191...
## 1.5 assault w/dangerous weapon       violent MULTIPOLYGON (((1620533 191...
## 1.6                    robbery       violent MULTIPOLYGON (((1620533 191...
## 1.7 assault w/dangerous weapon       violent MULTIPOLYGON (((1620533 191...
## 1.8                    robbery       violent MULTIPOLYGON (((1620533 191...
## 1.9                    robbery       violent MULTIPOLYGON (((1620533 191...

… differ from this?

st_join(
  crimes_prj,
  census_prj)
## Simple feature collection with 4001 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1612036 ymin: 1916607 xmax: 1629292 ymax: 1935532
## Projected CRS: NAD83 / Conus Albers
## # A tibble: 4,001 × 9
##    id          date_time           offense         offense_group          geometry
##  * <chr>       <dttm>              <chr>           <chr>               <POINT [m]>
##  1 20000205-01 2020-01-01 04:25:41 robbery         violent       (1623818 1930744)
##  2 20000179-01 2020-01-01 04:35:55 assault w/dang… violent       (1623369 1921838)
##  3 20000202-01 2020-01-01 04:53:33 assault w/dang… violent       (1620690 1927035)
##  4 20000232-01 2020-01-01 06:07:24 assault w/dang… violent       (1623821 1927168)
##  5 20000268-01 2020-01-01 07:21:17 robbery         violent       (1625764 1926691)
##  6 20000357-01 2020-01-01 11:44:05 assault w/dang… violent       (1622738 1917506)
##  7 20000409-01 2020-01-01 13:24:30 robbery         violent       (1623530 1922956)
##  8 20000417-01 2020-01-01 14:05:17 robbery         violent       (1620148 1926027)
##  9 20000483-01 2020-01-01 17:29:07 robbery         violent       (1617620 1933535)
## 10 20000596-01 2020-01-01 20:04:25 assault w/dang… violent       (1623058 1926786)
## # … with 3,991 more rows, and 4 more variables: GEOID <chr>, edu <dbl>,
## #   income <dbl>, population <dbl>

If we wanted to summarize crimes by census tract, we’d either want to use the multipolygon object:

st_join(
  census_prj,
  crimes_prj) %>% 
  group_by(GEOID) %>% 
  summarize(n = n())
## Simple feature collection with 179 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1610777 ymin: 1913781 xmax: 1629412 ymax: 1936182
## Projected CRS: NAD83 / Conus Albers
## # A tibble: 179 × 3
##    GEOID           n                                                    geometry
##    <chr>       <int>                                          <MULTIPOLYGON [m]>
##  1 11001000100    11 (((1615776 1925234, 1615813 1925290, 1616278 1925438, 1616…
##  2 11001000201     2 (((1614701 1926320, 1614687 1926397, 1614726 1926406, 1614…
##  3 11001000202    11 (((1614781 1925546, 1614779 1925559, 1614777 1925583, 1614…
##  4 11001000300     2 (((1614218 1927310, 1614239 1927318, 1614255 1927323, 1614…
##  5 11001000400     2 (((1614952 1927603, 1614952 1927605, 1614952 1927607, 1614…
##  6 11001000501     3 (((1616249 1927355, 1616254 1927356, 1616257 1927359, 1616…
##  7 11001000502     4 (((1615344 1928620, 1615334 1928676, 1615345 1928675, 1615…
##  8 11001000600     2 (((1614244 1929685, 1614286 1929694, 1614364 1929709, 1614…
##  9 11001000701     3 (((1614178 1927835, 1614181 1928071, 1614183 1928298, 1614…
## 10 11001000702     2 (((1614201 1927403, 1614223 1927408, 1614223 1927411, 1614…
## # … with 169 more rows

Or, change it to a tibble and then summarize the crime information:

st_join(
  census_prj,
  crimes_prj) %>% 
  as_tibble() %>% 
  group_by(GEOID) %>% 
  summarize(n = n())
## # A tibble: 179 × 2
##    GEOID           n
##    <chr>       <int>
##  1 11001000100    11
##  2 11001000201     2
##  3 11001000202    11
##  4 11001000300     2
##  5 11001000400     2
##  6 11001000501     3
##  7 11001000502     4
##  8 11001000600     2
##  9 11001000701     3
## 10 11001000702     2
## # … with 169 more rows

Notice a difference? We should always use the method that uses the least amount of memory!

  Calculate the number of crimes per census tract and generate a map of the data, with census tracts colored by the number of violent crimes in 2020.

With our data joined, it’s a good time to simplify the file down to a more reasonable size:

census_simple <-
  st_join(
    census_prj,
    crimes_prj) %>% 
  rmapshaper::ms_simplify(keep = 0.05)

How did that affect our object size?

st_join(
  census_prj,
  crimes_prj) %>% 
  object.size() %>% 
  format('Mb')
## [1] "13.7 Mb"
census_simple %>% 
  object.size() %>% 
  format('Mb')
## [1] "4 Mb"

A massive improvement!

Any thoughts on why we’d want to simplify this object after joining the data?

  Modify the DC Census data until the size of the resultant object is less than 4 Mb. Assign the object to your global environment as crimes_simpler. Then create a map where each census tract is colored by the number of violent crimes per 1000 people in 2020.

Our last step in saving memory is reducing the number of files in our global environment:

rm(dc_crimes,
   dc_census,
   violent_crimes,
   dc_census_reduced,
   crimes_sp,
   census_simple,
   census_prj)

tmap

We’ve so far worked with base R plotting and ggPlots (the latter being a major improvement). We can take our plots a major step forward by using the tmap package.

Before we can plot in tmap though, we first have to transform the CRS of our objects to 4326.

  In a single map statement, transform crimes_prj and census_simpler to EPSG 4326. Save to your environment as crimes_4326 and census_4326, then remove crimes_prj and census_simpler from your global environment.

## <environment: R_GlobalEnv>

To start a tmap, we first have to specify which type of map we want to plot. We do this with the function tmap_mode():

tmap_mode('plot')

Each tmap object is generated in two steps. First you initialize the data you want to use, then you specify how you want to display the data:

tm_shape(census_4326) +
  tm_polygons()

Like ggplot, we elements to our map using the + pipe. The tmap package is sort of a weird blend of ggPlot and base R plotting though. To specify a color, we use the col = argument in tm_polygons():

tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000')

We can add a basemap to tmap. We’ll specify the bounding box of our shape and the function tmaptools::read_osm() to get an OpenStreetMap:

basemap <-
  st_bbox(census_4326) %>% 
  tmaptools::read_osm()

Each layer is added by a new call to tm_shape():

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000')

That legend position is now pretty terrible. We can specify layout using the function tm_layout. This works similarly to theme() in ggPlot (but is a bit more clunky, in my opinion).

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_layout(legend.outside = TRUE)

We have two options for adding points to our plot. We can add points by type of offense. Let’s remind ourselves of the types of violent crimes in the DC data:

crimes_4326 %>% 
  pull('offense') %>% 
  unique()
## [1] "robbery"                    "assault w/dangerous weapon"
## [3] "sex abuse"                  "homicide"

  Filter crimes_4326 to just offenses classified as “robbery” and assign it to your global environment as “robberies”.

Let’s add robberies using the tm_dots() function:

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(robberies) +
  tm_dots() +
  
  tm_layout(legend.outside = TRUE)

The default dot size is really small. We can make them bigger using the size = argument (see ?tmdots):

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(robberies) +
  tm_dots(size = 0.15) +
  
  tm_layout(legend.outside = TRUE)

… and add some transparency too, using the alpha = argument:

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(robberies) +
  tm_dots(size = 0.15,
          alpha = 0.4) +
  
  tm_layout(legend.outside = TRUE)

  You can add markers to a map using the tm_markers function. Without creating a new object, add markers representing homicides in Washington DC in 2020.

Perhaps more useful is coloring the dots by the type of incident.

  Filter the crimes data to where the offense is either robberies or homicides (but don’t assign an object to your global environment).

## Simple feature collection with 2194 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -77.10262 ymin: 38.81347 xmax: -76.91175 ymax: 38.98272
## Geodetic CRS:  WGS 84
## # A tibble: 2,194 × 5
##    id          date_time           offense offense_group             geometry
##  * <chr>       <dttm>              <chr>   <chr>                  <POINT [°]>
##  1 20000205-01 2020-01-01 04:25:41 robbery violent       (-76.96479 38.93453)
##  2 20000268-01 2020-01-01 07:21:17 robbery violent       (-76.95198 38.89564)
##  3 20000409-01 2020-01-01 13:24:30 robbery violent       (-76.98609 38.86694)
##  4 20000417-01 2020-01-01 14:05:17 robbery violent       (-77.01757 38.89979)
##  5 20000483-01 2020-01-01 17:29:07 robbery violent       (-77.02908 38.96992)
##  6 20000700-01 2020-01-01 23:34:57 robbery violent        (-76.9943 38.92558)
##  7 20000725-01 2020-01-02 12:54:00 robbery violent       (-76.92923 38.89458)
##  8 20001102-01 2020-01-02 18:06:38 robbery violent       (-77.04168 38.91132)
##  9 20001173-01 2020-01-02 19:04:57 robbery violent       (-77.02422 38.91807)
## 10 20001235-01 2020-01-02 21:14:07 robbery violent       (-76.99308 38.92451)
## # … with 2,184 more rows

Let’s filter the data to robberies and homicides and color the points by the type of incident:

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(
    crimes_4326 %>% 
      filter(offense %in% c('robbery', 'homicide'))) +
  tm_dots(
    col = 'offense',
    size = 0.15) +
  
  tm_layout(legend.outside = TRUE)

The default colors here are terrible! We can specify out own colors using the palette function:

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(
    crimes_4326 %>% 
      filter(offense %in% c('robbery', 'homicide'))) +
  tm_dots(
    col = 'offense',
    palette = c(robbery = 'blue', homicide = 'red'),
    size = 0.15) +
  
  tm_layout(legend.outside = TRUE)

And let’s bring alpha = back in to see both levels when the data overlap:

tm_shape(basemap) +
  tm_rgb() +
  
  tm_shape(census_4326) +
  tm_polygons(col = 'crimes_per_1000') +
  
  tm_shape(
    crimes_4326 %>% 
      filter(offense %in% c('robbery', 'homicide'))) +
  tm_dots(
    col = 'offense',
    palette = c(robbery = 'blue', homicide = 'red'),
    size = 0.15,
    alpha = 0.5) +
  
  tm_layout(legend.outside = TRUE)